2. Quality Control

# libraries for this chapter
library(tidyverse)
library(Seurat)
library(data.table)

The single-cell dataset may contain low-quality cells, such as those that are dead. These low-quality cells are characterized by low UMI counts, low gene counts, and a high mitochondrial percentage. To filter out these low-quality cells, we can determine appropriate cutoff based on the distribution of these metrics.

1. Sample Multiplexing Kit (SMK)

In an experiment that utilizes the Sample Multiplexing Kit (SMK) for multiplexing samples, it is expected that some cells will be identified as “Multiplets” and “Undetermined”.

Important

Multiplets and Undetermined in SMK

The concept of multiplets in SMK is different from the idea of duplets in which one microwell contains two cells. Multiplets in SMK kits can include both traditional duplets, as well as singlets where cell has two SMK tags.

The Undetermined refers to cell labels that have zero or very low SMK tag reads detected. These can be either noisy cell labels or singlets that are not labeled with SMK tags.

To visualize the number of cells in each SMK tag, we can use a bar plot. By visualizing the number of cells in each SMK tag using a bar plot, we can gain insights into the distribution of cells across different tags.

seuratObj@meta.data %>% 
  group_by(Sample_Name) %>% 
  tally(name = "cell_count") %>% 
  ggplot(aes(x=Sample_Name, y=cell_count, fill= Sample_Name)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(y = cell_count,
                label = cell_count,
                vjust= -0.5, size = 2.5), show.legend = FALSE) + 
  theme_classic() +
  theme(axis.title.x=element_blank(),
        legend.position = "none")

According to the bar plot, most of the cells are either in Sample Tag 01 or Sample Tag 02. However, there are some cells labeled as Multiplet or Undetermined which cannot be assigned to either sample tag. To exclude these Multiplets and Undetermined cells from further analysis, we can use the subset function. This function will allow us to remove the unwanted cells and focus only on the cells belonging to Sample Tag 01 or Sample Tag 02.

seuratObj <- subset(seuratObj, 
                    subset = Sample_Name %in% c("Multiplet", "Undetermined"), 
                    invert = T)

2. Assessing quality metrics

To identify cells of low quality, we will use the quality metrics such as nCount_RNA, nFeature_RNA, and mitochondrial percentage. We will determine which cells should be considered of low quality and exclude them from further analysis.

2.1 nCount_RNA UMI counts per cell

The term nCount_RNA refers to the UMI counts per cell.

We create two plots side by side: a ridge plot and a violin plot. These plots are organized by samples. By observing the ridge plot, we notice the presence of two distinct peaks. This indicates that there are at least two cell populations within the dataset, each with a different transcriptome size.

p1 <- seuratObj@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=nCount_RNA, fill= Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic()

p2 <- VlnPlot(seuratObj, features = "nCount_RNA", split.by = "Sample_Name") +
  scale_y_log10()

p1|p2

2.2 nFeature_RNA Gene counts per cell

The term nFeature_RNA refers to the gene counts per cell.

Once again, we generate two plots side by side: a ridge plot and a violin plot. These plots are organized by samples. In our analysis, we focus on the sum of gene counts per cell using the nFeature_RNA metric. Notably, the majority of cells have gene counts above 300.

p1 <- seuratObj@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=nFeature_RNA, fill= Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic()

p2 <- VlnPlot(seuratObj, features = "nFeature_RNA", split.by = "Sample_Name") +
  scale_y_log10()

p1|p2

2.3 Mitochondrial percentage

When you create a Seurat object, Seurat automatically generates some metadata for each cell. These metadata are stored in the meta.data. You can use the metadata to filter out low-quality cells. Specifically, the metadata includes two columns: nCount_RNA and nFeature_RNA. However, it does not include information on mitochondrial percentage.

Before proceeding, it’s essential to calculate the mitochondrial percentage for each cell. Fortunately, Seurat provides a convenient function called PercentageFeatureSet that allows you to compute the percentage of UMIs associated with mitochondrial genes in a cell.

seuratObj[["percent.mt"]] <- PercentageFeatureSet(seuratObj, pattern = "^MT-")
Tip

To ensure correct calculation of the mitochondrial percentage, it’s essential to consider the variations in mitochondrial gene names across different genome versions. You can use the grep command to examine the gene name patterns associated with mitochondrial genes.

For instance, in the demo dataset, mitochondrial genes typically begin with the prefix “MT-”. By identifying this consistent pattern, you can confidently calculate the mitochondrial percentage for each cell.

grep(pattern = "mt-", rownames(seuratObj), ignore.case = T, value = T)

The PercentageFeatureSet function accepts a pattern argument and scans through all feature names in the dataset for that specific pattern. In our case, we are interested in mitochondrial genes, so we search for any gene names that begin with the typical pattern associated with mitochondrial genes.

For each cell, the function calculates the sum of counts across all genes that belong to mitochondrial genes. It then divides this sum by the total counts for all genes in that cell. The resulting value is stored in the percent.mt field within the meta.data.

When examining the mitochondrial percentage per cell, we observe that the majority of cells have a mitochondrial percentage below 25%.

p1 <- seuratObj@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=percent.mt, fill=Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  theme_classic()

p2 <- VlnPlot(seuratObj, features = "percent.mt", split.by = "Sample_Name")

p1|p2

2.4 Combine metrics

To determine the filtering threshold, we can combine the following metrics: nCount_RNA, nFeature_RNA, and percent.mt. By considering these factors together, we can make informed decisions about cell quality and inclusion in our analysis.

seuratObj@meta.data %>% 
  ggplot(aes(x=nCount_RNA, y=nFeature_RNA, color=percent.mt)) + 
  geom_point() + 
  scale_colour_gradient(low = "gray90", high = "black") +
  stat_smooth(method=lm) +
  scale_x_log10() + 
  scale_y_log10() + 
  theme_classic() +
  geom_vline(xintercept = 500, colour = "red") +
  geom_hline(yintercept = 300, colour = "red") +
  facet_wrap(~Sample_Name) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

The data is fitted with a linear regression line. We expect that as UMI counts increase, gene counts will also increase.

Quadrants:

  • Upper Right Quadrant: Cells in this quadrant exhibit both high gene counts and high UMI counts. These cells are generally considered to be of good quality.

  • Bottom Right Quadrant: Cells in this quadrant have low gene counts and low UMI counts. These cells are typically considered to be of low quality.

Darker Cells:

  • The darker cells represent cells with a high mitochondrial percentage. Interestingly, many of these cells also have low gene counts. This observation may indicate that these cells are damaged or dying.

  • One possible explanation is that the cytoplasmic mRNA in these cells is leaking out through a broken membrane. As a result, only the mRNA located in the mitochondria remains conserved.

4. Filter out low quality cells

Let’s summarize the quality thresholds for filtering out low-quality cells based on the provided metrics:

nCount_RNA Cells with UMI counts greater than 500 are considered acceptable.

nFeature_RNA Cells with gene counts greater than 300 and less than 5000 fall within the desired range.

percent.mt Cells with a mitochondrial percentage below 25% are preferred.

By applying these thresholds, we can effectively filter out cells that do not meet the quality criteria.

filtered <- subset(x = seuratObj, 
                       subset = (nCount_RNA >= 500) & 
                         (nFeature_RNA >= 300) &
                         (nFeature_RNA <= 5000) &
                         (percent.mt < 25))

5. Re-assess metrics

After applying the filtering criteria, it’s essential to verify that the filtered data aligns with our expectations. To do so, we revisit the quality control (QC) metrics and create plots using the filtered dataset. These updated plots will help us assess the quality of the remaining cells.

p1 <- filtered@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=nCount_RNA, fill= Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic()

p2 <- VlnPlot(filtered, features = "nCount_RNA", split.by = "Sample_Name") +
  scale_y_log10()

p1|p2

p1 <- filtered@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=nFeature_RNA, fill= Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  scale_x_log10() + 
  theme_classic()

p2 <- VlnPlot(filtered, features = "nFeature_RNA", split.by = "Sample_Name") +
  scale_y_log10()

p1|p2

p1 <- filtered@meta.data %>% 
  ggplot(aes(color=Sample_Name, x=percent.mt, fill=Sample_Name)) + 
  geom_density(alpha = 0.2) + 
  theme_classic()

p2 <- VlnPlot(filtered, features = "percent.mt", split.by = "Sample_Name")

p1|p2

filtered@meta.data %>% 
  ggplot(aes(x=nCount_RNA, y=nFeature_RNA, color=percent.mt)) + 
  geom_point() + 
  scale_colour_gradient(low = "gray90", high = "black") +
  stat_smooth(method=lm) +
  scale_x_log10() + 
  scale_y_log10() + 
  theme_classic() +
  geom_vline(xintercept = 500, color = "red") +
  geom_hline(yintercept = 300, color = "red") +
  facet_wrap(~Sample_Name)

From the QC plots, we can see low-quality cells from the bottom right quadrant are removed. We can also compare the cell count before and after filtering. The majority of cells are retained even after applying the quality filters.

Before filtering:

seuratObj@meta.data %>% 
  group_by(Sample_Name) %>% 
  dplyr::summarise(cell_number = length(Sample_Name))
# A tibble: 2 × 2
  Sample_Name    cell_number
  <chr>                <int>
1 SampleTag01_hs        2113
2 SampleTag02_hs        2846

After filtering:

filtered@meta.data %>% 
  group_by(Sample_Name) %>% 
  dplyr::summarise(cell_number = length(Sample_Name))
# A tibble: 2 × 2
  Sample_Name    cell_number
  <chr>                <int>
1 SampleTag01_hs        1851
2 SampleTag02_hs        2310